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© A program executing on a first processor in an 
MP configuration awaiting the release of a resource 
held by another processor, detects the expiration of 
a fixed lime interval, and initiates a hierarchy of 
recovery actions designed to cause the resource to 
be freed. These actions, targeted at a processor 
believed to be the one currently holding the re- 
source, are taken only if that processor is not ex- 
ecuting an "exempt" routine. The actions, taken in 
order of increasing severity, are: wait for a second 
fixed time interval: terminate the routine on the 
resource-holding processor, allowing retry; terminate 
the routine without allowing retry; invoke Alternate 
CP Recovery. The hierarchy is escalated against the 
target processor until that processor releases the 
resource, and against other processors in the con* 
figuration until the resource is acquired by the first 
processor. These actions may proceed in parallel for 
multiple detecting and target processors within an 
MP environment. 



=■ rr I EXCESSIVE SPr* L3CP 



i r 



CP I 



C£T LOCX t 



^ 101 



. 0IS-8t_£D LOOP 

*0 ' £W> 



V 



tO* 23 «CLE*S£ LOCK i 
I 

t 0 *22 ,' 



\ 1 

excessive spin 

OCTECTEO 



EXCESS tv£ SPIM 
// (SEE FIG. 21 y 



/exc 



GET LOOC I 



Ran* Xorox (UK) Business Services 



European 
Patent Office 



EUROPEAN SEARCH 
REPORT 



Acc»c&::cr fierce.- 



EP 89 11 0328 



DOCUMENTS CONSIDERED TO BE RELEVANT 



Citation of document w»tn indication, wnere aporoenate. 
of relevant passages 



:984 INT. CONF. ON INDUSTRIAL ELECTRONICS. QDN- 
TROL AND INSTRUMENTATION. IEC0N*84 vol. 2. 22 Octo- 
ber 1984. TOKYO. JP pages 1 1 69 • 1176; T.C. YANG ET 
AL.. 'A reliable multi-processor system' 
* page i 1 72. right column, line 40 - page 1 \ 73. left column, 
iir.e 10 " * 

IBM TECHNICAL DISCLOSURE BULLETIN, vol. 16. no. 7. 
December 1973. NEW YORK US pages 2420 - 2422; P H. 
GUM ET AL.: 'Locking architecture in a multiple virtual 
memory multiprocessing system' 

' page 2420. line 33 - line 36 * * * page 2422. line 9 • line 13 ' 



PROCEEDINGS OF THE FALL JOINT COMPUTER CON- 
FERENCE Oecember 1968. SAN FRANCISCO. US pages 
39 - 53; A.N. HlGGINS: 'Error recovery through program- 
ming' 

* page 42. left column, line 20 - right column, line 26 " * 



The present searcn report rtas oeen drawn up for ail claims 



Relevant 
to ciaim 



CLASSIFICATION OP THE 
APPLICATION (int. CI. 5) 



1.7 



1.7 



1.4-10 



G 06 F 9-46 
G 06 F 1! 00 



TECHNICAL FIELDS 
SEARCHED (int. CI.S) 



G 06 F 





Place of searcn 




Oate of completion of searcn 


Examiner 




The Hague 




02 December 91 




ADMINISTRATION 




CATEGORY OP CfTEO OOCUMEMTS 


E: 


earlier patent document but published on. or after 


X : 


particularly relevant if taken alone 






the filing date 


V : 


particularly relevant If combined with another 


O: 


document cited in the application 




document of the tame category 




L: 


document cited for other r9M*ori9 


A : 


technological background 










O: 


non-written disclosure 




4: 


member of the same patent family, corresponding 


P: 


intermediate document 






document 




T : 


theory or principle underlying the invention 









J 



Europaisches Patentamt 
European Patent Office 
Office europeen des brevets 



t^j P'joi:cat;or. nurrcer: 



0 351 536 

A2 



EUROPEAN PATENT APPLICATION 



0 Acciicaticr. numcer: 39110328.5 
@ Date of riling: 08.06.39 



0 Int. CI.* G06F 9/46 



0 Priority: 19.07.38 US 221169 

0 Oate of publication of aopiicaticn: 
24.01.90 Bulletin 90/04 

•® Designated Contracting States: 
OE FR GB 



0 Applicant: International Business Machines 
Corporation 
Old Orchard Road 
Armonk, N.Y. 10504(US) 

@- Inventor: Daly, James Clifford 
RR2, Box 46A South Road 
Millbrook New York 12545(US) 
Inventor: Nick, Jeffrey Mark 
43 Plymouth Road 
Fishkiil New York 12524(US) 
Inventor: Rodegeb, Franklin John 
62 Kent Road 

Wappingers Falls New York 12590<US) 



0 Representative: Jost Ottokarf, Dipl.-lng. 
IBM Deutschland GmbH Patentwesen und 
Urheberrecht Schbnaicher Strasse 220 
0*7030 Bobilngen(OE) 



0 Systematic recovery of excessive spin loops in an n-way mp environment 



0 A program executing on a first processor in an -13. ' : SfL%£ >&ie? 



MP configuration awaiting the release of a resource 1 ~ - a . 

held by another processor, detects the expiration of i •' 

a fixed time interval, and initiates a hierarchy of : :> ' ; 

recovery actions designed to cause the resource to ' 
be freed. These actions, targeted at a processor «, <=■ V 

believed to be the one currently holding the re- 
source, are taken only if that processor is not ex- , 

^pocuting an "exempt" routine. The actions, taken in 
order of increasing severity, are: wait for a second 

CO fixed time interval; terminate the routine on the 

2 resource-holding processor, allowing retry; terminate 
the routine without aJlowing retry; invoke Alternate 

J"? CP Recovery. The hierarchy is escalated against the 

^target processor until that processor releases the 
resource, and against other processors in the con- 



1 >, : 

'figuration until the resource is acquired by the first 1 ! - ^||K^,> 3 >/; 



^ processor. These actions may proceed in parallel for 

UJ multiple detecting and target processors within an 9 t 

MP environment ^-a mluv lx* i 
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SYSTEMATIC RECOVERY OF EXCESSIVE SPIN LOOPS IN AN N-WAY MP ENVIRONMENT 



~ u ;s rver.uen relates to the ne«d or systems 
crcc'smmirg. More sc ecificaliy. »t "elates to 
^ecrarisms : or :etec ting ana recovering from som 
ccc 5 rubers n multiprocessor system configura- 
te.-;. 5 

■a «oin ccp is a condition which occurs in a 
m ci :i processor (MP) system when a routine execut- 
ing on one Central Processor (CP) is unable to 
oorroiete a function due to a dependence on some 
action oeing taken on another CP. If the function ?o 
must be eomoietea before further processing can 
be oerformed. the routine may enter a loop and 
spm waning ror the required action to be taken on 
the other CP. 

Spin ioccs typically occur in systems such as f$ 
MVS.XA ana MVS/ESA when a system routine is 
attempting to perform one of the following func- 
tions: 

1 . Communicate with other CPs - For exam- 
ple, when an MVS system routine running on one 20 
CP determines that an address space should be 
swapped out of main storage, it is necessary to 
notify ail other CPs to purge their translation 
lookaside buffers of addresses related to that ad- 
dress space. This is accomplished by issuing a 25 
SiGP (Signal Processor) Emergency Signal to the 
other CPs. Until each CP responds with an indica- • 
tion that tt has performed the required purge, the 
initiating MVS routine will enter a spin loop to await 
completion of the required action. 30 

2. Serialization of function across all CPs - 
MVS uses system locks to serialize execution of 
many func:;cns across ail of the CPs in the system. 
This is -ecessary to ensure the integrity of the 
operation cemg performed. The general locking 35 
architecture usea in the MVS system is described 

in the ISM Technical Disclosure bulletin. Oec. 
1973. Volume 16. No. 7, at page 2420. As an 
example. :f an MVS routine on one CP wishes to 
process the results of an I/O interrupt from a de- *o 
vice, it must ensure that status about the interrupt 
is not inadvertently corrupted by a system routine 
on another CP wishing to initiate a new I/O opera- 
tion to the aevice. This is accomplished via the use 
of a system lock per device. If a system routine <s 
requires the lock for a given device which is owned 
by a routine on another CP. it will enter a spin loop 
until the lock becomes available. 

Spin loops are a normal phenomenon of an MP 
system. They are almost always extremely brief so 
ana non-disruptive to the operating environment. 
However, when their duration becomes excessive, 
spin loops become a problem which requires re* 
covery action to resolve. In the prior art. those 
actions were determined and performed by the 



system ccerator. 

Excessive spin -ceo (ESL) c era tier. s -zir Zr 
triggered for a .vice variety of causes, Fcr exam; 
the CP which is r.cicing a resource resume jy ;re 
routine soinnmg cn another CP may be: 
o Experiencing a hardware railtre 
o Experiencing a software failure 
0 Performing a critical function .vmcn :akes an 
unusually long period of time to complete 
o Stopped by the operator or by -he z-cerivr-z 
system 

In the past, the MVS operating sys;~rr de- 
tected the existence of an ESL and surfaces :re 
conaition to the system operator. The :etec::cn 
was performed by the routine in the spin icop, af-er 
spinning for a full ESL timeout interval, which was 
approximately 40 seconds in MVS. It then invoked 
the Excessive Spin Notification Routine, to issue a 
message to the operator requesting recovery ac- 
tion. 

Determination of the correction recovery action 
to resolve an ESL condition is complex, error- 
prone, and especially critical given the severe m- 
pact such a condition has on the operating system. 
Due to the frequency of inter-processor commu- 
nication and cross-CP resource serialization in an 
MP environment, when one CP faiis, ail ether CPs 
very quickly enter spin loops until the problem on 
the failing CP is resolved. 

According to the prior art, there were three 
recovery actions that an operator can take when an 
ESL occurs. Each has benefits and drawbacks as- 
sociated with it. The actions are as foilows: 

1. Respond to the ESL message to continue to 
spin on the detecting CP for another excessive 
spin loop interval. 

This will only have benefit if the cause of the 
spin loop is temporary, i.e. t if it is due to some 
unusually lengthy but legitimate processing on the 
CP causing the condition. 

The problem here is that neither the operator 
nor MVS knows whether the condition is temporary 
or not. If the operator does not respond to continue 
the spin and instead performs a recovery action, 
the possibility exists that an important MVS system 
function will be the target of that destructive recov- 
ery action. This may even result in an unnecessary 
system crash. 

On the other hand, if the operator does decide 
to continue the spin, how many times should the 
spin be allowed to repeat before taking a more 
forceful action? Each response to continue in the 
spin loop further prolongs the time that the system 
is unavailable. 

2. Respond to the ESL message to trigger :he 
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MVS -.ter-.ate CP Recovery .ACR) function for the 
:aiiing CP. The general ACR 'unction •$ cescncea 
;n !3M Technical Cisclcsure Suiietin. Nov. 1 973. 
Volume 16. No. o. at cage 20G5. The algorithm 
-•sea to determine .vhich is the 'aiiing CP m an N- i 
way environment .s -escribed in :SM Technical 
Oisccsure Suiietin. July :933. Volume 25. No. 2 at 

-,= <-a 7 -4. 

-J w „ . 

Tns causes tre recovery routines protecting 
:re program running on the failing CP to be in- *o 
vcked. This is cone to allow the recovery routines 
•o release .^sources held on the faiiing CP which 
may be required by the CP currently in a soin loop. 

The drawback or this action is that it also 
-esults in removing the "failing" CP from use by /s 
the MVS operating system, experience has shown 
•hat excessive spin loops are usually caused by 
non-CP related hardware or software errors. The 
recovery processing associated with ACR may re- 
solve ihe spin loop but removing the CP from the 20 
configuration is highly disruptive and also unnec- 
essary in the majority of spin loop scenarios. 

Even with a highly-skilled operator, who deter- 
mines and performs each recovery action after only 
30 seconds delay, the system is completely un- 25 
available for several minutes. In addition, the CP is 
unnecessarily removed from system use for an 
undetermined period of time. 

Another drawback of the ACR action can be 
that recovery routines are allowed to retry after jo 
being invoke :. Therefore, ihe ability of the ACR 
action to resolve the spin loop and avoid a system 
outage is highly dependent on the effectiveness of 
•he recovery routines protecting the failing pro- 
gram. If the recovery routines do not release the 35 
resources --squired by the CP in the spin loop, or 
retry tack to a point in the failing program which 
caused the problem to begin with, the spin loop 
condition will not be resolved. 

3. Respond to the SSL message to continue -*o 
the spin on the detecting CP AND initiate a RE- 
START from the system console to interrupt the 
routine executing on the failing CP. This action will 
trigger invocation of recovery routines to force the 
reiease of resources held on the failing CP. ^5 

The drawback of this action is that it results in 
termination of the current unit of work because 
recovery routines are not allowed to retry when 
RESTART is invoked. Thus, even though the re- 
covery routines may be able to successfully re* so 
solve the problem causing the spin loop, the pro- 
gram is forced to terminate, if a critical job or 
subsystem is active on the faiiing CP when the 
spin loop is detected, invocation of RESTART will 
cause loss of that critical subsystem and perhaps 55 
require re-IPl of the system. 

Another drawback is that the RESTART proce- 
dure is more complicated than simply responding 



•0 a message 3r.a is therefore :::*•■= •: :z-.y.'r 
error. 

Most ESL conditions, jue to : center -=<■-:- :■ 
inaaequate ':;cvery options. er.G .vttn 5 
casn ana an extencea outage recuirrc *e- 

In addition to the ccmo:e:<;ties of the *tc:.-"/ 
decisions required by tne ooeratcr to -ectt.er — 
an ESL condition, the rr.ec.nancs :r erfec::.-- — =; 
recovery become significantly mere mvcived : -r~ 
operator is unable to answer the sc:n :cc — e:- 
sage ang instead must rescend to tne scm :cr 
restartabie wait state. For examcie. :cr £n -CP. 
response, the ooerating procedure involves: 

1. Stopping ait CPs tn the system 

2. Storing the ACR response m cam s-oracs 
on the detecting CP (which may ce m iv.cn :; 
the installation's policies) 

3. Starting all the CPs except tre :etect:rc 
and failing CPs 

4. Restarting the detecting CP to initiate re- 
covery. 



SUMMARY OF THE INVENTION 

The present invention is a system and orccess 
in a multiprocessor system environment, for detect- 
ing and taking steps to automatically recover frcm 
excessive spin loop conditions, it comprises func- 
tions and supporting indicators that clearly identify 
true spin loop situations, and present a hierarchical 
series of recovery actions, some new to the ESL 
environment, that minimize the imoact of the con- 
dition to the multiprocessor system, and its -vcr* 
kload. 

It is an object of the present invention to pro- 
vide an automatic and efficient mechanism for de- 
tecting and recovering from excessive soin !coo 
situations in an MP environment. 

It is a further object of this invention to recog- 
nize persistent, related spin loop situations in an 
MP environment, and recover automatically frcm 
them. This includes recovering in parallel from mul- 
tiple ESL occurrences involving more than one 
failing CP. 

it is a further object of this invention to present 
a hierarchy of recovery actions representing pro- 
gressively more severe actions, so that a severe 
action is taken only when a less severe action has 
failed to resolve the problem. 



BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a linear time flow diagram showing 
an overview of the Excessive Spin Coop Recovery 
(ESLR) Function operating in a 2-way MP environ- 
ment. 
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r : <;. 2 s £ ;urc::on : !gw aagram outlining 
E-< : rssive 5o:n Lccp Recovery processing. 

f'r } , 3 is a runction flow ciagram showing tne 
hie.-arcny cf recovery actions -aken .vithin ESLR 
c--:cess;~g. 

rig. -i is a linear -time flow diagram showing 
a scenario n .vhtch ESLR processing is useo to 
:-rSC:ve a srm : oco -eaclcck situation in a 5-way 
\i? environment. 

Fgure i snows an environment in which an 
emccdiment of -he present invention operates. It 
iilustrates a 2-way MP system consisting of Central 
Processor i (10) and Central Processor 2 (11). 
Centrai Processor i , having obtained spin-type lock 
x at time ;0 { 1 0 1 ). subsequently enters a disabled 
(oop (102): Central Processor 2. requesting spin 
lock x at time to * i (110), is unable to obtain it. 
and so "spins", periodically re-requesting the lock 
(Hi). 

As with systems of the prior art, it is the 
resoonsibiiity of the processes which have request- 
ed a spin-type lock to determine that a "long" time 
has elapsed since the lock was requested (a time 
interval referred to as the ESL or Excessive Spin 
Loop, interval); having recognized that this period 
of time has elapsed* (112). the requesting proces- 
sor invokes the Excessive Spin Coop Recovery 
(ESLR) processing of this invention (113). This 
processing ultimately results in the release of the 
lock by processor 1 (103). and allows the subse- 
quent acquisition of the lock by processor 2 (114). 

Referring to figure 2. excessive spin loop re- 
covery processing is entered when the CP request- 
ing .the lock detects that it has been waiting for the 
icck for an excessive amount of time. On entry, this 
routine checks to determine whether excessive 
spin !oop recovery processing is active on any 
other CP in the complex by checking the CVT 
global control block (24) via the atomic "Test and 
Set'* operation. If the answer is yes, there is an 
immediate return and this indication is not treated 
as a detection of an excessive spin loop. 

If the answer is no. the failing CP is identified 
as indicated in the aforementioned TDB (Vol. 26, 
No. 2. at p. 748), and the identity of the failing CP 
is saved. A check is then made to see whether any 
spin loop recovery action was taken for this failing 
CP within the last excessive spin loop interval. If 
so. subsequent recovery processing is bypassed. 
In tightly-coupled MP systems of three of more 
CPs, this is done because two different CPs could 
enter ESLs against the same failing CP within the 
same interval. When the first of these two ESLs 
results in a recovery action, the second ESL must 
be prevented from initiating another (more disrup- 
tive) action before the first one has a chance to 
complete. 

The Excessive Spin Loop Recovery Processor 



■'ESLR) maintains a :acie m giccai itcrace s. 
the :lrre cf ihe last ESL recovery actrcn -a-er 
against eacn CP This Last Action Taken LA 7 
Table i25) nas one entry rer CP. ESLR then 

5 pares :he ciccx /aiue on entry .vith :r.e LA 7 e ■-:.-•/ 
for :he failing CP. if an ESL interval has not casiec 
Since the =ast actiGn against :his railing C? 
action is taKen. However. :re : ast -e:ec::cr: :;rr,e 
LASTOT (23) fieid -s ucaated because tr:s :e:ec- 

'G ticn must fce recorded to ensure -he crccer reter- 
mination of a persistent problem. The :.'cc:< waiue 
is again obtained and then stereo m the giooal ESL 
field (28). indicating that this detection is trea-eo as 
a giobat detection, and the routine returns :o :re 

ts caller 

If no action was taken for this CP within :re :as: 
ESL interval, a check is made to see if an ESL .vas 
detected against any CP within the iast two ESL 
intervals (23). 

20 The question here is whether two consecutive 

(ESL) occurrences represent repeated manifesta- 
tions of the same problem (i.e., a persistent prob- 
lem) or whether each ESL occurrence represents a 
separate problem. If an ESL is identified as occur- 

25 ring for a persistent problem, the recovery action 
for that ESL will be the next one in the series of 
increasingly severe actions for that particular rating 
CP. 

If an ESL is determined to be the initial mani- 
ao festation of a problem, all the ESL indicators for ail 
CPs are reset so that any sequence of actions for 
any CP starts at the first action. 

The Excessive Spin Loop Recovery Processor 
(ESLR) maintains a field (LASTDT) (23) in gfobai 
35 storage showing the time of the last detection of an 
ESL against ANY CP. 

A persistent problem exists if: T-LASTDT < 
2xESLI where: 

T = time of this entry to the ESL Recovery routine 
40 ESU = excessive spin loop interval. 

When processing of this ESL is complete, LASTDT 
is updated with the current time at exit from ESLR 
process. 

Given that time between entries to ESLR from 
45 a given spin routine is equal to ESU plus a very 
small delta consisting of linkage time from the spin 
routine to the ESLR process, it follows that the spin 
routine will continue to call ESLR in less than two 
spin loop time-out intervals until it has obtained its 
so acquired resource. However, a given invocation of 
ESLR may be locked out if another CP has already 
serialized the ESLR function. Therefore, ESLR 
must be cognizant of ail entries to ESLR from any 
CP. If no entry to ESLR occurs from any CP for 
55 two or more spin loop time-out intervals, then it 
follows that ALL spinning routines obtained ALL 
their desired resources subsequent to the last call 
to ESLR. 
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~'r.s next cr.ecfc s a :e termination .vr. ether ite 
faiiir.g C? *s *n fact executing a routine -hat .s 
exenoteo from excessive spin !cop recovery pro- 
cessing (incicated in :he LCCA block (27)). A 
mechanism for providing such an exemction is 
-ecuired because there are legitimate system rou- 
tines .vbch couid otherwise trigger 55L conditions 
because -he -ime ? .o complete trie function exceecs 
;. u .e ESL :ime-eut value. This allows the system 
routines to set an indicator around the -'engthy 
function in a field checked by the ESL recovery 
process. This exemption mechanism allows the 
ESL interval to be reduced far below its value m 
previous MVS systems of 40 seconds to signifi- 
cantly improve ESL recovery performance. It elimi- 
nates the need to spin for such long periods to 
avoid an ESL detection and recovery action for a 
legitimate, temporary condition. Some MVS func- 
tions included in this validly exemoted category are 
■hose which load restartabte CP wait states for 
operator communication, place a CP temporanly in 
a stopped state, or communicate with the operator 
via disabled console communication facility. 

if the failing CP is not executing an exempt 
routine, recovery action is initiated for that failing 
CP. This recovery action processing is further de- 
scribed in Figure 3. Having taken the appropriate 
recovery action, the current clock value is placed in 
the LAT field (26) of the failing CP and the global 
ESL field (LASTOT (28)) and return is made to the 
caller. 

Referring to Figure 3. on entry to recovery 
action processing an index is incremented asso- 
ciated with the failing CP. A check is then made 
against the value of the index. If the value equals 1. 
a return is made to the caller. This results in a 
continuation of spinning on the desired lock for 
another ESL interval, it is important to wait for this 
additional ESL interval since it is possible that a 
call may have been made to excessive spin loop 
recovery processing in the window of time between 
the clearing of the exemption Mag and the enabling 
of the associated CP and in this case no disruptive 
recovery action is desired. 

If the index is equaJ to 2. an indicator is set in 
the CVT control block indicating A8END as the 
recovery action. A Signal Processor instruction in- 
dicating restart is then issued to the failing CP to 
give control to the restart FLiH. Return is then 
made to the caller. On the failing CP the RESTART 
FLIH checks the CVT indicator and sets a flag 
indicating the ABEND action and passes control to 
the Recovery Termination Manager to execute the 
ABENO action, which allows the recovery routines 
to retry after performing any necessary clean up. 

If the index is equal to 3. the CVT flag is set to 
indicate the TERMINATE recovery option. A signal 
processor instruction indicating restart is then is- 



sued :o the failing CP :■: ;a-jse t-= ~^ : ^ r 
Termination Manager ;o cegm r^rr.ir.r ;r ;-.=; ~ - 
The TERMINATE option :i;fers Vcm 'ne -E = v~ 
cotton in that it zees rot allow -eccvery .-cut-res ■: 

5 -e try. Resources owned oy :re railing :nt :; -.--< 
are released, ano the jnit cf -vcrn is : crcsz •;■ 
■ erminate. Return is then mace :o :he cailer. 

if :he index is ecuai to * Alternate C? = ec:v- 
ery fACR) is initiated for :ne : aning CP. "rt-< n*:* 

'0 ation is effected by -he cetectir.g crccesscr 5: n re- 
lating the receipt of a malfunction =;en n:er~..ct:cn 
from -the failing CP wmcn initiates act:ons resu -:;rc 
in taking this CP off-line. 

S-WAY EXAMPLE 

Figure 4 illustrates Excessive Spin Lceo Re- 
covery processing active in a S-WAY MP system. 

20 with two independent excessive spin loops: -he first 
involves CPs 0, 1 and 2 all wating for a resource 
held by failing CP 3; the second involves CP 4 
waiting for a resource held by failing CP 5. The 
example shows: 

25 1. Simultaneous resolution of independent 

ESLs 

2. Correct progression ihrcugh the hierarchy 
of recovery actions for each ESL taking increas- 
ingly severe action when previous action failed -c 

:o resolve the problem. 

3. Pacing of actions taken for related ESLs 
(multiple CPs spinning on the same failing CP). 

At times. T, T*2. and T + 3, the waiting CPs 
(0, 1.2 and 4) request the needed resource cf CP 

35 3 or 5. At T+10. CP 0. noticing that an ESL 
interval (here, to seconds) has elapsed -vithcut 
obtaining the resource, calls E3LR processing, 
which sets the CP 3 index to 1 and saves -he time 
of this ESLR processing (T+iO.1) in the LAT field 

40 for CP 3 (figure 2B at 26), and LASTDT (28), and 
then returns to the caller who continues to spin (as 
indicated in figure 3. since this is the initial detec- 
tion). At T + 12. CP 4 detects an ESL. calls ESLR. 
which sets the CP 5 index to 1 and saves the time 

45 (T* 12.1) in LAT entry for CP 5 (26) and LASTDT 
(28). and then continues to SPIN (fig. 3). Simulta- 
neously at T + 12, CP 1 detected an ESL. and 
invoked ESLR - which immediately returned since 
ESLR was already active on CP 4 (see fig. 2A at 

so 21). At T+ 13, CP 2 detected its ESL called ESLR. 
which takes no recovery action since one was 
taken for this failing CP (CP 3) within the last ESL 
interval (see fig. 2A at 22). The time (T + 13.1) is 
saved in LASTOT (28). At T + 20.1. another ESL 

55 interval having passed for CP 0. ESLR is again 
invoked; since no action was taken for failing CP 3 
within the last ESL interval (T+ 1O.1 -T + 20.1) (see 
fig. 2A at 22), a recovery action is taken, tne mcex 
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;-:r :? 3 is rcrenerted -c I {fig. 3 at 31). and the 
■a=ENO is s;gnai!eo to CP 3 (22). The rime 
*T-*0 2) s saved in LAT for C?3 (26). and 
LAS TOT ?23). At T^22. CP 1 again detects the 
exorancn of another ESI interval, calls ESLR. 
.vnicn -rakes no action since action *.vas taken for 
CP 3 within tne past £3L interval {fig. 2A at 22). 
The rime J*22.\) ;s saved in LAS TOT (23). Also 
a: 7-22.:. C? -t careers the expiration of an E3L 
interval, calls ESLR. which immeGiately returns 
since ESLR is already running on CP i (fig. 2A at 
21). At T- 23.1. CP 2 notes the passing of an ESL 
interval, cails ESLR, which takes no action since 
action was taken for CP 3 within rhe last ESL 
interval {fig. 2A at 22). The time (T + 23.2) is saved 
:n LAS TOT (23). At time T + 30.2. CP 0 detects the 
passage of another ESL interval (the ABEND sig- 
nalled to CP 3 at T + 20.2 has not resolved the 
problem on CP 3). calls ESLR. which, since no 
action was taken for CP 3 within the last ESL 
interval, increments CP 3's index (fig. 3 at 31) to 3. 
then signals "Terminate" to CP 3 (33). Time 
(T + 30.3) is saved in the LAT entry for CP 3 (26) 
ana in LASTDT (28). Note that in this example, the 
Terminate action against the unit of work on CP 3 
resolves the spin loop on CPs 0, f and 2. At 
T + 32.1 CP 4. detecting the expiration of another 
ESL interval (T + 22.1 - T>32.1) calls ESLR. ESLR. 
realizing that no action was taken for CP 5 within 
the last ESL interval (T + 22.1 -T + 32.1; LAT for CP 
5 is T * 12.1). but there was an ESL detected 
against some CP within the last two ESL intervals 
(fig. 2A at 23), ESLR increments the index asso- 
ciated with CP 5 to 2 (fig. 3 at 31) and signals 
ABEND to CP 5 (32). The time (T + 32.2) is saved 
in LAT for CP 5 (26),. and in LASTDT (28). In the 
example, the ABEND action against the unit of 
work on CP 5 resolves the spin loop on CP 4. 



Claims 

1. In a multiprocessing system complex com- 
prising at least two processors, an operating sys- 
tem, and resources shared among processors, a 
method for recognition of and recovery from exces- 
sive spin loops by the operating system compris- 
ing: 

A) detecting, by a detecting routine in a first 
processor, that said first processor has been in a 
spin loop requiring a resource held by a resource- 
holding routine in another processor for a fixed 
time period: 

B) identifying a target processor in said sys- 
tem complex as a target for responsive recovery 
action: 

C) performing no responsive recovery ac- 
tions if a bypass indicator set by a routine in said 



seccr.c crccessor so rclcares: 

0) automatically ceric-rrr.ir.g .'or 5a:c :=.*:-• 
processor :ne of a hierarchical lecuerce zi = - 
sponsive programmed recovery acncr.s : ia\c z\. • 

5 pass mcicatcr is off: 

E) continuing ro .cenrify zaia rarget arc - : 
perform subsequent hierarchical recovery =ct::r.s 
for said target processor until said target zrccz$z<z: 
is no longer so identified as saic target: 

'0 F) continuing to so detect -the ho Icing :f my 

of said resources for saic fixed time cerec ar.c :c 
identify target processors ana cerform rartjet 
processor-specific hierarchical recovery act:crs _n- 
til all of saia resources are acquireG by ail :etect- 

'5 ing processors. 

2. The method of claim 1 in whicn a s^c se- 
quent one of said recovery actions m saia .hierar- 
chical sequence is performed for said target pro- 
cessor only if an immediately preceding one of 

20 said hierarchical recovery actions has been per- 
formed for said target processor :cnger ago :han 
one of said fixed time periods. 

3. The method of claim 2 in which said subse- 
quent action in said hierarchical sequence is cer- 

25 formed if there has been said detecting of one of 
said spin loops requiring one of said resources 
held by any of said processors in said multiproces- 
sing complex within two of said fixed rime periods, 
and in which an initial one of said hierarchical 

30 , actions is performed otherwise. 

4. The method of claim 3 in w'nich saic hierar- 
chical sequence comprises the action of aonor- 
mally terminating said routine in said target proces- 
sor in a manner that permits a resource-hciaing 

J5 routine in said target processor to resume normal 
execution after cleanup. 

5. The method of claim 3 in which said hierar- 
chical sequence comprises the actions of: 

A) continuing to wait for said resource to be 
40 released for a second fixed time period; 

8) abnormally terminating a resource-holding 
routine in said target processor in a manner that 
permits said routine in said target processor to 
resume normal execution after cleanup; 
45 C) terminating said resource-holding routine 

in said target processor in a manner that does not 
permit said routine in said target processor to re- 
sume normal execution; 

0) removing said target processor from said 
so multiprocessor system complex. 

6. The method of claim 3 in which said hierar- 
chical sequence comprises the following actions, in 
the order listed: 

A) continuing to wait for said resource to be 
55 released for a second fixed time period; 

8) abnormally terminating said resource- 
holding routine in said target processor in a man- 
ner that permits said routine in said target prcces- 
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; cr :r -esume rcrmai execution a^er c!ear.up: 

2) terminating said resource-noting routine 
r. 3aiG target processor m a manner that ooes ret 
:erm:t saia routine <n saia target processor :o re- 
i'jne normal execution: 3 

3) removing sa:d target processor from saia 
-nu! tierce essor system ocmpiex. 

7 = muitsc-rccessing system complex com- 
prising at east two processors, an operating sys- 
tem, and resources snared among processors, a io 
mecnamsm for recognition of and recovery from 
excessive spin iocps oy the operating system com- 
prising: 

A) detection means for detecting that a first 
processor has been in a spin loop requiring a is 
-esource held by a routine in a secona processor 
for a fixed time perioa; 

3) identification means for identifying a tar- 
get processor in said system complex as a target 
for responsive recovery action when said detecting 20 
means detects said spin loop: 

C) a processor-bypass indicator associated 
with each of said processors and having an "on" 
setting and an "off" setting, said bypass indicator 
being set to said "on'* setting when an exempt 25 
routine is executing in said processor associated 
with said "on" bypass indicator; 

0) responsive recovery means for freeing 
said resource held by said target processor only if 
said processor-bypass indicator associated with 30 
said target processor is "off". 

8. The mechanism of claim 7 in which said 
responsive recovery means comprises a hierarchi- 
cal set of recovery junctions, which further com- 
orise an A35NC-tnggering function for causing a 35 
.-esource- r vising routine executing in said target 
crocesscr to abnormally terminate, allowing retry. 

r ™~e mechanism of claim 7 in which said 
reset- ~s"«e recovery means comprises a hierarchi- 
ca: set of recovery functions, said functions com- 40 

-\- • > r> • 
r " - 1 * 3 * 

A» a spin function for permitting said first 
processor to remain in said spin loop for a second 
fixec time period; 

3) an ABEND-triggering function for causing 45 
a -e source-holding routine executing in said target 
processor to abnormally terminate, allowing retry; 

C) a TERMINATE-triggering function for 
causing a resource-holding routine executing in 
said target processor to terminate without retry; so 

0) an ACR function for removing said target 
processor from said multiprocessor system com- 
plex. 

10. The mechanism of claim 9 further compris- 
ing means for causing successive detections of 55- 
said spin loop fixed time periods resulting in iden- 
tification of the same target processor or a different 
target processor to cause invocation of one of said 

7 



recovery functions, saic rectverv *'."•::.: 
■nvoKed m the order A. 5. 2. 2 i 
target orccessor. .r one or said *".rx::cr = 
/okea 'ess recently tnan saic : i.<ec ::~e z 
said identifiea target processor. 

n. The mechanism of ::aim iQ ''Lrs 
prising means for rausmg a successive 
of said spin loop «ixec time cencd i'oiicat 
ceiection to inveke a secuenriai -ecovery 
for said identified target croc essor -.vr.en 5 
cessive detection occurs .vimm 2 : ixed :;r 
vals of said prior detection. 
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